Cloud Runner Improvements - LTS Candidate - S3 Locking, Aws Local Stack (Pipelines), Testing Improvements, Rclone storage support, Provider plugin system#731
Conversation
- Implemented a primary attempt to pull LFS files using GIT_PRIVATE_TOKEN. - Added a fallback mechanism to use GITHUB_TOKEN if the initial attempt fails. - Configured git to replace SSH and HTTPS URLs with token-based authentication for the fallback. - Improved error handling to log specific failure messages for both token attempts. This change ensures more robust handling of LFS file retrieval in various authentication scenarios.
- Added permissions for packages, pull-requests, statuses, and id-token to enhance workflow capabilities. - This change improves the CI pipeline's ability to manage pull requests and access necessary resources.
…ation - Added configuration to use GIT_PRIVATE_TOKEN for git operations, replacing SSH and HTTPS URLs with token-based authentication. - Improved error handling to ensure GIT_PRIVATE_TOKEN availability before attempting to pull LFS files. - This change streamlines the process of pulling LFS files in environments requiring token authentication.
…entication - Enhanced the process of configuring git to use GIT_PRIVATE_TOKEN and GITHUB_TOKEN by clearing existing URL configurations before setting new ones. - Improved the clarity of the URL replacement commands for better readability and maintainability. - This change ensures a more robust setup for pulling LFS files in environments requiring token authentication.
… pipeline - Replaced instances of GITHUB_TOKEN with GIT_PRIVATE_TOKEN in the cloud-runner CI pipeline configuration. - This change ensures consistent use of token-based authentication across various jobs in the workflow, enhancing security and functionality.
…L unsetting - Modified the git configuration commands to append '|| true' to prevent errors if the specified URLs do not exist. - This change enhances the reliability of the URL clearing process in the RemoteClient class, ensuring smoother execution during token-based authentication setups.
…tion - Updated comments for clarity regarding the purpose of URL configuration changes. - Simplified the git configuration commands by removing redundant lines while maintaining functionality for HTTPS token-based authentication. - This change enhances the readability and maintainability of the RemoteClient class's git setup process.
# Conflicts: # dist/index.js # dist/index.js.map # jest.config.js # yarn.lock
… log lines for test assertions
…off; lint/format fixes
… cache key for retained workspace (#379)
…logs; tests: retained workspace AWS assertion (#381)
…rd local provider steps
…nd log management; update builder path logic based on provider strategy
…sed on provider strategy and credentials; update binary files
…ained markers; hooks: include AWS S3 hooks on aws provider
…t:ci script; fix(windows): skip grep-based version regex tests; logs: echo CACHE_KEY/retained markers; hooks: include AWS hooks on aws provider
… update binary files
…rintf for empty input
…I hangs; s3 steps pass again
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/model/build-parameters.ts`:
- Line 213: The cloneDepth assignment should validate and sanitize
CloudRunnerOptions.cloneDepth before use: parse it with Number.parseInt, then if
Number.isNaN(parsed) or parsed < 0 (or not an integer) replace it with a
sensible default (e.g. a DEFAULT_CLONE_DEPTH constant or 1); update the
cloneDepth property assignment in build-parameters.ts (the cloneDepth field and
CloudRunnerOptions.cloneDepth reference) to use the validated/fallback value so
downstream git operations never receive NaN or a negative depth.
In `@src/model/cloud-runner/providers/aws/aws-cloud-formation-templates.ts`:
- Around line 22-26: The getSecretDefinitionTemplate function currently returns
a snippet starting with the top-level "Secrets:" key which causes duplicate YAML
keys when called per-secret; change getSecretDefinitionTemplate to return only
the list item block (the "- Name: '...'\n ValueFrom: !Ref ...") without the
"Secrets:" header, and update the caller loop that inserts these snippets to
either (a) create a single "Secrets:" header once and concatenate all list-item
snippets under it, or (b) emit the header on the first insertion only and append
subsequent list items — locate getSecretDefinitionTemplate to modify its return
string and the code that calls insertAtTemplate('p3 - container def', ...) to
ensure the "Secrets:" header is produced exactly once.
In `@src/model/image-tag.ts`:
- Line 41: Validate containerRegistryImageVersion before assigning to
this.imageRollingVersion by checking it matches a Docker tag-safe pattern (e.g.,
start with an alphanumeric/underscore and only contain alphanumerics, dots,
underscores or dashes, max length ~128) and reject values containing disallowed
characters like '/' or spaces; update the assignment site (the code where
this.imageRollingVersion = containerRegistryImageVersion) to perform this check
and throw or return a clear error when the value is invalid so failures are
explicit and early.
🧹 Nitpick comments (2)
src/model/cloud-runner/remote-client/index.ts (2)
213-213: Simplify async return statement.The function is already
async, so wrapping the return value in a Promise constructor is unnecessary. This can be simplified.♻️ Suggested simplification
- return new Promise((result) => result(``)); + return ``;
309-311: Address static analysis hints for naming and formatting.ESLint flags the variable name
depthArg(should bedepthArgument) and Prettier flags the line formatting.♻️ Suggested fix
- const depthArg = CloudRunnerOptions.cloneDepth !== '0' ? `--depth ${CloudRunnerOptions.cloneDepth}` : ''; + const depthArgument = CloudRunnerOptions.cloneDepth !== '0' ? `--depth ${CloudRunnerOptions.cloneDepth}` : ''; await CloudRunnerSystem.Run( - `git clone ${depthArg} ${CloudRunnerFolders.targetBuildRepoUrl} ${path.basename(CloudRunnerFolders.repoPathAbsolute)}`.trim(), + `git clone ${depthArgument} ${CloudRunnerFolders.targetBuildRepoUrl} ${path.basename( + CloudRunnerFolders.repoPathAbsolute, + )}`.trim(), );
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/model/cloud-runner/providers/aws/index.ts (1)
163-171: Unconditional wait for cleanup stack deletion when it may not exist.The
DeleteStackCommandfor the cleanup stack (line 151) is only sent whenCloudRunnerOptions.useCleanupCronistrue, but thiswaitUntilStackDeleteCompletefor the cleanup stack is always executed. IfuseCleanupCronisfalse, the cleanup stack was never created, so this wait will either fail or timeout unnecessarily.🔧 Proposed fix
await waitUntilStackDeleteComplete( { client: CF, maxWaitTime: stackWaitTimeSeconds, }, { StackName: taskDef.taskDefStackName, }, ); + if (CloudRunnerOptions.useCleanupCron) { await waitUntilStackDeleteComplete( { client: CF, maxWaitTime: stackWaitTimeSeconds, }, { StackName: `${taskDef.taskDefStackName}-cleanup`, }, ); + }src/model/cloud-runner/remote-client/index.ts (1)
363-363: Fix contradictory assertion message.Line 363 asserts
fs.existsSync(path.join('.git', 'lfs'))but the message says "LFS folder should not exist before caching". Either the assertion or message is incorrect.If LFS folder should NOT exist at this point:
- assert(fs.existsSync(path.join(`.git`, `lfs`)), 'LFS folder should not exist before caching'); + assert(!fs.existsSync(path.join(`.git`, `lfs`)), 'LFS folder should not exist before caching');
🤖 Fix all issues with AI agents
In `@src/model/cloud-runner/providers/k8s/kubernetes-task-runner.ts`:
- Line 154: The checks inside KubernetesTaskRunner that inspect error?.message
should use the normalized errorMessage variable (set earlier) instead of
accessing error.message directly; update both occurrences (the condition near
the isRunning/continueStreaming branch and the later check around line 204) to
use a safe include against errorMessage (e.g. (errorMessage ||
'').includes('previous terminated container')) so string rejections from
CloudRunnerSystem.Run are handled consistently and won't silently fail.
🧹 Nitpick comments (12)
src/model/cloud-runner/tests/cloud-runner-s3-steps.test.ts (3)
22-24: Remove or replace trivial test with meaningful assertion.This test asserts
true === true, which provides no validation. If the intent is to verify the file loads without errors, the import/parse step already does that. Consider removing this test or replacing it with a meaningful assertion (e.g., verifying exported functions exist).
72-126: Consider extracting repeated credential setup in customJob YAML.The AWS credential configuration block is duplicated three times across
test-s3-pull-cache,test-s3-upload-cache, andtest-s3-upload-buildsteps. This increases maintenance burden and risk of inconsistency.Consider moving credential setup to a shared initialization step or relying on environment variables passed to containers rather than repeating
aws configurein each step.
134-135: Remove redundantshouldRunS3check.This code path is only reachable when
shouldRunS3istrue(guarded at line 43), making the inner check redundant.- // Only run S3 operations if environment supports it - if (shouldRunS3) { + // Run S3 verification operations + {src/model/cloud-runner/providers/aws/aws-job-stack.ts (1)
21-30: ExtractgetStackWaitTime()to a shared module to avoid duplication.This function and
DEFAULT_STACK_WAIT_TIME_SECONDSare duplicated identically in three files:aws-job-stack.ts,aws-base-stack.ts, andindex.ts. Consider extracting them toaws-client-factory.tsor a dedicated utility module.♻️ Example consolidation
In
aws-client-factory.ts:const DEFAULT_STACK_WAIT_TIME_SECONDS = 600; export function getStackWaitTime(): number { const overrideValue = Number(process.env.CLOUD_RUNNER_AWS_STACK_WAIT_TIME ?? ''); if (!Number.isNaN(overrideValue) && overrideValue > 0) { return overrideValue; } return DEFAULT_STACK_WAIT_TIME_SECONDS; }Then import it in the other files:
import { getStackWaitTime } from './aws-client-factory';src/model/cloud-runner/providers/aws/index.ts (1)
105-109: Consider removing or documenting the unused factory method calls.
getECS()andgetKinesis()are called but their return values are discarded. If the intent is to pre-initialize clients for early failure detection, add a comment explaining this. Otherwise, remove these calls since the factory will lazily initialize clients when actually needed.ResourceTracking.logAllocationSummary('aws workflow'); await ResourceTracking.logDiskUsageSnapshot('aws workflow (host)'); - AwsClientFactory.getECS(); const CF = AwsClientFactory.getCloudFormation(); - AwsClientFactory.getKinesis();src/model/cloud-runner/providers/k8s/index.ts (2)
162-200: Test cleanup block uses dynamic import unnecessarily.The dynamic import of
CloudRunnerSystemat line 168-170 is unnecessary sinceCloudRunnerSystemis already imported at the module level (line 19). While past reviews marked this as addressed, the dynamic import pattern persists. Use the existing module-level import directly.♻️ Suggested fix
if (process.env['cloudRunnerTests'] === 'true') { try { CloudRunnerLogger.log('Cleaning up old images in k3d node before pulling new image...'); - const { CloudRunnerSystem: CloudRunnerSystemModule } = await import( - '../../services/core/cloud-runner-system' - ); // Aggressive cleanup: remove stopped containers and non-Unity images // ... rest of code using CloudRunnerSystem instead of CloudRunnerSystemModuleAnd update line 188:
- await CloudRunnerSystemModule.Run(cmd, true, true); + await CloudRunnerSystem.Run(cmd, true, true);
206-282: Redundant dynamic import in image cache validation block.Similar to the cleanup block, line 208-210 dynamically imports
CloudRunnerSystemwhen it's already available at module scope. This adds unnecessary overhead and complexity.♻️ Suggested fix
if (process.env['cloudRunnerTests'] === 'true' && image.includes('unityci/editor')) { try { - const { CloudRunnerSystem: CloudRunnerSystemModule2 } = await import( - '../../services/core/cloud-runner-system' - ); - // Check if image is cached on agent node (where pods run) - const agentImageCheck = await CloudRunnerSystemModule2.Run( + const agentImageCheck = await CloudRunnerSystem.Run( `docker exec k3d-unity-builder-agent-0 sh -c "crictl images | grep -q unityci/editor && echo 'cached' || echo 'not_cached'" || echo 'not_cached'`,Apply similar changes to all
CloudRunnerSystemModule2.Runcalls in this block.src/model/cloud-runner/tests/e2e/cloud-runner-end2end-retaining.test.ts (1)
141-170: Cleanup logic is thorough but duplicates earlier block.The cache cleanup logic (lines 141-170) largely duplicates the workspace cleanup logic (lines 111-138). Consider extracting a shared helper function to reduce duplication.
♻️ Suggested refactor pattern
async function safeCleanupDirectory(dirPath: string, logger: typeof CloudRunnerLogger) { if (!fs.existsSync(dirPath)) return; try { logger.log(`Cleaning up: ${dirPath}`); await CloudRunnerSystem.Run(`chmod -R u+w ${dirPath} 2>/dev/null || true`); await CloudRunnerSystem.Run(`rm -rf ${dirPath} 2>/dev/null || true`); await CloudRunnerSystem.Run(`find ${dirPath} -type f -delete 2>/dev/null || true`); await CloudRunnerSystem.Run(`find ${dirPath} -type d -empty -delete 2>/dev/null || true`); } catch (error: any) { logger.log(`Failed to cleanup: ${error.message}`); } }src/model/cloud-runner/cloud-runner.ts (1)
156-168: Dynamic provider loading with fallback is acceptable but consider logging level.The dynamic provider loading approach with fallback to LocalCloudRunner is reasonable for extensibility. However, silently falling back could mask configuration errors in production.
Consider using
CloudRunnerLogger.logWarninginstead ofCloudRunnerLogger.logfor the fallback message to make it more visible.src/model/cloud-runner/tests/e2e/cloud-runner-end2end-caching.test.ts (1)
102-136: Cleanup logic duplicates the retaining test pattern.This cleanup block is nearly identical to the one in
cloud-runner-end2end-retaining.test.ts. Consider extracting a shared test utility for cache cleanup to reduce duplication across e2e tests.src/model/cloud-runner/remote-client/index.ts (1)
397-400: Remove unconventionalvoid 0statement.The
void 0;statement at line 399 is unconventional for a no-op. The comment alone suffices, or use an empty block.} catch { /* no-op: best-effort git lfs pull without tokens may fail */ - void 0; }src/model/cloud-runner/providers/k8s/kubernetes-storage.ts (1)
61-75: Avoid creating redundant KubeConfig instances; reuse the initialized client's kubeconfig.While both
kubeClientand the independently createdKubeConfigat line 62 useloadFromDefault()and thus load the same default cluster config, creating separate KubeConfig andStorageV1Apiinstances each time this method is called is inefficient and violates DRY.Instead, add
StorageV1Apito the Kubernetes class initialization (inindex.ts), or pass theKubeConfiginstance as a parameter to this method to reuse it. The same pattern is repeated at line 187.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/model/docker.ts`:
- Around line 58-60: The image detection logic (isAlpineBasedImage /
commandPrefix) misses tagged and registry-qualified refs (e.g., alpine:3.19 or
ghcr.io/rclone/rclone:1.66); update detection to normalize or pattern-match the
image string instead of exact equality: extract the repository/name portion or
use a regex to match (^(.*/)?alpine(:|$) or (^(.*/)?rclone/rclone(:|$)) so
images with tags or registry prefixes are detected as Alpine-based, then set
commandPrefix to /bin/sh accordingly; update the comment placement to include
the required blank line before it.
🧹 Nitpick comments (1)
src/model/cloud-runner/providers/docker/index.ts (1)
140-141: Add blank line before the comment to satisfy ESLint.Static analysis reports a missing blank line before the comment.
🔧 Proposed fix
const entrypointFilePath = `start.sh`; + // Use #!/bin/sh for POSIX compatibility (Alpine-based images like rclone/rclone don't have bash) const fileContents = `#!/bin/sh
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/model/cloud-runner/providers/docker/index.ts`:
- Around line 97-106: The set endpointEnvironmentNames in
src/model/cloud-runner/providers/docker/index.ts is missing the INPUT_* variants
for CloudFormation, ECS, Kinesis, and CloudWatch Logs; update the Set used in
the localhost rewriting logic (symbol: endpointEnvironmentNames) to include
'INPUT_AWSCLOUDFORMATIONENDPOINT', 'INPUT_AWSECSENDPOINT',
'INPUT_AWSKINESISENDPOINT', and 'INPUT_AWSCLOUDWATCHLOGSENDPOINT', and apply the
same additions to the equivalent sets/logic in
src/model/cloud-runner/providers/k8s/kubernetes-job-spec-factory.ts and
src/model/cloud-runner/providers/aws/aws-task-runner.ts so containers using
INPUT_* endpoint env vars are rewritten to host.docker.internal consistently.
Reverts cosmetic changes that renamed workflow_id to workflowId in GitHub API calls. The GitHub REST API uses workflow_id, so we keep the eslint camelcase suppression comments to match the official API convention. Also restores the getCheckStatus() method that was removed. Co-Authored-By: Claude Opus 4.5 <[email protected]>
…s, versioning.test.ts These files had changes unrelated to the Cloud Runner improvements PR goals. Reverting to main branch state. Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ovider The rclone/rclone image is Alpine-based and only has /bin/sh, not /bin/bash. This fixes exit code 127 errors when running rclone commands in containers. Co-Authored-By: Claude Opus 4.5 <[email protected]>
The previous implementation fetched ALL PR refs with: git fetch origin +refs/pull/*:refs/remotes/origin/pull/* This is extremely slow for repos with many PRs (700+ PRs in unity-builder). Now fetches only the specific PR ref needed, e.g., for pull/731/merge: git fetch origin +refs/pull/731/merge:... +refs/pull/731/head:... This should significantly speed up the Cloud Runner integrity tests. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Tests are already covered by cloud-runner-integrity.yml Co-Authored-By: Claude Opus 4.5 <[email protected]>
Summary
Major improvements to Cloud Runner with LocalStack support, rclone storage provider, dynamic provider plugin system, and enhanced CI testing capabilities.
I have contacted LocalStack to regain access to ECS mocking functionality again, but for now mocking myself with local-docker for AWS workflows.
Changes
New Features
AWS_FORCE_PROVIDER=aws-localvalidates AWS CloudFormation templates while executing via local-docker (no LocalStack Pro required)New Action Inputs
resourceTrackingawsEndpointawsCloudFormationEndpointawsEcsEndpointawsKinesisEndpointawsCloudWatchLogsEndpointawsS3EndpointstorageProviders3(default) orrclonercloneRemotemyremote:bucket/path)cloneDepthcloudRunnerRepoNameImprovements
Bug Fixes (includes #686)
Secrets:block)imageRollingVersionnow supports dot versions (e.g., "3.1.0")Testing
jest.ci.config.js)Documentation
src/model/cloud-runner/providers/README.md)CI Testing Modes
AWS_FORCE_PROVIDER=awsAWS_FORCE_PROVIDER=aws-localRelated PRs
Checklist
Summary by CodeRabbit
New Features
Enhancements
Chores
✏️ Tip: You can customize this high-level summary in your review settings.